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ABSTRACT 

It is shown that the two-part Minimum Description Length Principle can be used to discriminate 
among different models that can explain a given observed dataset. The description length is chosen 
to be the sum of the lengths of the message needed to encode the model plus the message needed to 
encode the data when the model is applied to the dataset. It is verified that the proposed principle 
can efficiently distinguish the model that correctly fits the observations while avoiding over-fitting. 
The capabilities of this criterion are shown in two simple problems for the analysis of observed spec- 
tropolarimetric signals. The first is the de-noising of observations with the aid of the PCA technique. 
The second is the selection of the optimal number of parameters in LTE inversions. We propose 
this criterion as a quantitative approach for distinguising the most plausible model among a set of 
proposed models. This quantity is very easy to implement as an additional output on the existing 
inversion codes. 

Subject headings: polarization — methods: data analysis, statistical, numerical 
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1. INTRODUCTION 

When a scientist tries to analyze a given observed data 
set, it is customary to begin by examining the data in 
various ways, such as plotting the data and looking for 
patterns. In a second step, the scientist proposes a num- 
ber of physically plausible models that can reproduce 
the observed data set. A model fitting procedure is then 
applied to the data set so that the parameters that char- 
acterize each model are inferred. After all the proposed 
models are fitted to the data, the next step is to compare 
them and infer which of the fitted models is the most suit- 
able. Methods for such a task have been developed. For 
instance, for models that represent a hierarchica l struc- 
ture, we can use Akaik e's Information Criterion ijAkaikd 



or Mallows' C„ 



I ITqtI . These methods 



can be applied even in the case that the set of proposed 
models does not contain the perfect model. In this case, 
the aim is to select the most optimal one. If the mod- 
els are of completely different type and do not belong to 
a hierarchical structure of models, cross-validation type 
methods can be applied. However, they can be compu- 
tationally demanding. 

The standard procedure to select the optimum model is 
to make use of the Occam's Razor Principle (also known 
as the Principle of Parsimony). This principle is com- 
monly applied in science and is usually thought to be 
an heuristic approach for eliminating unnecessary com- 
plex hypothesis. The principle states that the selected 
model has to provide an equilibrium between the model 
complexity and its fidelity to the data. Nevertheless, de- 
termining the simplest model is often very complicated. 
It is usually argued that the number of parameters that 
parameterizes the model should be smaller than the num- 
ber of degrees of freedom of the data. However, it is a 
difficult matter to estimate the number of degrees of free- 
dom of the data. 

An alternative and successful procedure is to consider 
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the problem in terms of a communication process. As- 
sume that a sender S is interested in sending a given ob- 
servation to a receiver R. Several techniques are available 
for such a task. The most trivial one is to send the whole 
amount of data through a given channel. If the data set is 
very large, the length of the message will be consequently 
very large. If S is able to model the observation using a 
given model, it is wiser and shorter to transmit the model 
followed by the points in the observation that are not 
correctly reproduced by the model. If the model is good 
and simple enough, the length of the message between S 
and R will be much shorter than sending the complete 
observed data set. This compression will be degraded 
when the model needed for explaining the observations 
is made unnecessarily complex. Rissanen (1978) was the 
first in suggesting that the code length could be used for 
model comparison. This principle is nowadays known 
as the Minimum Description Length (MDL) Principle, 
which states that we should choose the model that gives 
the shortest description of the data. In this framework, 
there is an interplay between models that are concise and 
easy to describe and models that produce a good fit to 
the data set and captures the important features evident 
in the data. Of course, this is neither the only nor the 
best strategy for such purposes. However, its application 
is desirable in comparison with ad-hoc or trial- and-error 
ways of performing model selection. The reason is that 
MDL principle has strong theoretic al roots that lie on 
the Kolmogorov complexity theory ijVitanvi fc Lill2000l 
iGao et imi200nl) . As we will show, the practical applica- 
tion of the MDL principle is very easy to implement and 
it constitutes an ideal approach for model selection. 

This paper will be focused on one of the version of the 
MDL principle, the so-called two-part MDL. This strat- 
egy was developed by Rissanen in a series of pa pers pub- 
lishcd duri ng the seventies and eighties (e.g., Rissanen] 
[1978 . 1983t ll98fiD and summarized in iR.issanen (1989^ 
We give a brief summary of the main results in section|21 
As shown below, the main idea is to write the description 
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length of a given model applied to a data set as the sum 
of the length of the code for describing the model and 
the length of the code for describing the data set fitted 
by the model. 

The interpretation of spectropolarimctric data in so- 
lar and stellar observations allows to infer information 
about the properties of the magnetic field. Almost al- 
ways, the recovery of the magnetic field vector is based 
on the assumption of a model. These models depend 
on a given amount of variables that can be estimated 
by fitting the observed Stokes profiles. Among these 
models, we can find the Milne-Eddington approxima- 
tion, the L TE approximation, the MISMA hypothesis 
ijSanchez A lmeida ct al. 1996), etc. Sometimes, it is clear 
that a model is the most appropriate when observational 
clues are available. However, it is customary that these 
clues are not present and one relies on one model based 
on completely or partially subjective reasons. 

Some of these models are based on an extraordinar- 
ily large number of parameters. Until now, there is 
not any critical investigation toward analyzing whether 
enough information is available in the observations to 
constraint such a large amount of parameters. This work 
is a first step in this direction. We propose the applica- 
tion of the MDL criterion to quantitatively differentiate 
between possible models taking into account the infor- 
mation available in the observations. 

2. MINIMUM DESCRIPTION LENGTH 



'Ris sanenI l)1978j) related the problem of finite dimen- 
sional parameter estimation to the problem of designing 
an optimal encoding scheme. We consider the problem 
of transmitting a set of data from a sender S to a re- 
ceiver R. The sender must first communicate the type of 
model that can be used for describing the data. Con- 
sider, for instance, a set of points that we want to fit 
with a polynomial model. In this case, neither the order 
of the polynomial fc, nor the value of the parameters of 
the polynomial {a^, i = 0, . . . , A:} are known, so that this 
constitutes an extremely ill-posed problem. However, the 
MDL principle can be used to find the "most plausible" 
model that explains the observed points. In the first step, 
the sender has to communicate the order of the polyno- 
mial k. When the sender and the receiver agree on the 
type of model to use, the sender has to communicate the 
model itself, sending through the channel the k param- 
eters afe. If noise is present or if the proposed model is 
incomplete, the sender has to additionally communicate 
the deviations of the model from the data. 

The question of simplicity can consequently be tack- 
led by taking advantage of the work of Rissancn (1978). 
Each model is reduced to bits, the most fundamental in- 
formation unit. This allowed to transform the Occam's 
Razor principle into a completely functional principle. 
The measure of simplicity is just the number of bits re- 
quired to correctly and univoquely transmit a set of ob- 
servations by using a model. The sender S takes a set 
of observations as input, encodes and sends a message 
that contains all the information about the model and 
the data to the receiver R. Finally, the receiver decodes 
the message and produces an output. In information- 
preserving encoding schemes, the output obtained by R 
has to be identical to the original observation performed 
by S. Let D he a, set of observations (dataset) and M 



a model that is used to describe them. The quantity 
L{M) represents the length of the code in bits necessary 
to encode the model M. As well, L{D\M) represents 
the length of the data encoded using the model M (this 
term can be alternatively seen as the residual between 
the data D and the model M). The total length of the 
message is: 

L^L{M)+L{D\M). (1) 

The MDL principle tries to minimize L and the model 
associated with this minimum length is selected as the 
most plausible model. 

The previous analysis can be alt ernatively v iewed from 
the Bayesian perspective (see, e.g.. lGao et al. 2000). The 
aim is to infer a model M from a set of observations 
D. The solution lies in choosing the model that max- 
imizes the posterior probability p{M\D)^ which can be 
expressed as follows with the aid of the Bayes' theorem: 



p(M\D) 



p{D\M)p{M) 
p{D) 



(2) 



The term p{D) can be considered as a normalizing fac- 
tor and represents the probability that the data set D 
occurs. The term p{M) is the a priori probability of 
the model, that is, the probability that the model M 
is true before any data set has been observed. Finally, 
p{D\M) is the likelihood of the data given the model 
M . The relationship between the MDL formalism given 
by Eq. and the Bayes formalism given by Eq. 
is obtain ed bv m aking use of Shannon's optimal coding 
theorem l|Shannon 1948.a b) . This theorem states that 
the length of the ideal code for a value cc of a variable 
X which follows from a known probability distribution 
p{X) is given by: 

L{x)^-\ogp{X^x). (3) 

Taking the negative logarithm of Eq. jSJ , we obtain: 



\ogp{M\D) 



\ogp{D\M) - logp(M) -I- \ogp{D). 



.(4) 

The most plausible model is the one that mini- 
mizes — \ogp{M\D), that is, the one that minimizes 
— \ogp{D\M) — \ogp{M). Note that we have ignored 
the influence of p{D) since it is a constant that is shared 
for all the models and it is only associated with the data 
set. According to Shannon's theorem, the minimization 
of Eq. is equivalent to the minimization of Eq. 

2.1. Code Length Formulae 

Shannon's theorem states the length of the optimal 
code needed for transmitting a given number x whose 
probability distribution is known. Although extremely 
important, the theorem can be useless because this prob- 
ability distribution is usually not known and only approx- 
imate values of the length of the message can be obtained. 
An example is when one needs to transmit a set of integer 
or real numbers for which no probability distribution is 
known. It has been demonstrated that knowi ng exactly 
the e ncoding scheme is of accesory importance l|Rissanenl 
Il978|) . What is fundamental to know is which model 
gives the minimal length of the message given an arbi- 
trary encoding scheme. In the following, we assume that 
the probability distribution is not known and we present 
existing estimations for the length of the message needed 
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for communicating integer and real numbers. These for- 
mulae represent the first estimations that were obtained 
for the len gths o f the message for communicating models 

issane nl ll97l . Since they are not based on any prob- 
ability distribution for the integer or real numbers, they 
are known as universal priors ( Rissanen 1978) . 

Assume that we want to encode an integer number n 
in its binary representation. The binary representation 
of n has a length that can be estimated to be of the order 
of log2 n bits. If a set of integers are to be encoded, con- 
fusion arises if we pack all these binary digits together 
because the receiver does not know where the represen- 
tation of the first digit ends and the second digit starts. 
To resolve this, one can also encode the length of the bi- 
nary representation of n in binary representation, whose 
length can also be estimated to be logj logj n. The same 
problem arises again since the receiver does not know 
the length of this preamble representation. This prob- 
lem can be solved by giving the length of the preamble 
as log2 log2 log2 n as another preamble. This procedure 
can be iterated until log2 . . . log2 n is as close to zero (and 
positive) as desired. The integer number can be encoded 
with a scheme that includes these preambles, so that the 
total length of the code is given by: 

L{n) = log* n = log2 c -|- log2 n + log^ logs n + (5) 

where the constant c » 2 .865 is included for consistency 
(see, e.g.. lRissanenlll98^ . The symbol log* n is chosen 
to represent the previous sum. For large numbers, the 
dominant factor is log2 n, so that we can approximately 
assume that the length of the message for encoding an 
integer number is of t he order of log2 n for sufficiently 
large n l)Rissanenll983j) . For encoding integers with sign, 
we can add a single bit for setting the sign, so that the 
length of the message is log* n + 1 . The dominant factor 
is again log2 n. 

Encoding real numbers is more complicated than inte- 
ger numbers because we would need an infinite number 
of bits to encode the real number to infinite precision. To 
solve this in practice, we have to encode the real number 
X assuming a precision S, so that the encoded number xs 
fulfills |a; — a;^! < S. Once the precision is fixed, we can 
encode the integer part lxs\ a nd the fractiona l part of 
Xs separately. It can be shown l)RissanerJll978(l that the 
length of the message is L{xs) = log*[x5j -|- log* (1/5). 
As stated before, if x is large and taking into account 
that X ~ Xs, we can estimate the length of the message 
for encoding a real number x with 



L{x) w log2 X - log2 S. 



(6) 



Note that the length tends to zero when the number to 
encode approaches the precision. In the limit situation 
that X is smaller than the precision, the length takes a 
negative value that does not have meaning. 

With the previous considerations in mind, the model 
selection problem can be established. Let fi be the model 
that generates the data or the one that is more plausible 
and that it belongs to a class of m models M. Con- 
sider that each model is parameterized by ki parameters 
9ij, with 1 < i < m and 1 < j < ki. Given a set 
of observations, our aim is to choose the most plausible 
model fi from the set M and to estimate the parame- 
ters {Oij,l < j < ki}. The sender splits up the mes- 
sage in two parts. The first one contains the model itself 



and the second one contains the departures between the 
model and the data. We transmit the information about 
the model by first encoding the number of parameters 
that characterizes it and then transmitting the parame- 
ters themselves with a given precision Sj associated with 
each one: 



L{M) = log* h 



log* k,+ log* {1/6,] 



(7) 



Since th e log* n function behaves as log2 n for sufficiently 
large n ()RissanerJll983j) . we can transform the previous 
formula to: 



L{M) 



log2 ki 



log2 kj 



log2 6j 



(8) 



The previous equation gives the length of the model 
for encoding the parameters with arbitrary precision Sj . 
However, it makes no sense to increase the precision of 
the parameters unnecessarily because there may be no 
information in t he data fo r such a task. It has been 
demonstrated by iRissaneiil l)1989j) that if the optimum 
parameters are computed from a large set of n observed 
data points, the precision of the parameters can be ef- 
fectively encoded with only ^ log2 n bits. The reason for 
this is that, if the parameters are being estimated from 
the data, it makes no sense to encode them with a preci- 
sion larger than the standard error of the estimation. For 
a typical estimation of the Oij parameter, the standard 
error decreases with the number of data points as l/\/n. 
Therefore, the code length for encoding a real number 
with precision 1/y/n is — ^log2n. If all the points are 
used for estimating all the parameters, we can rewrite 
Eq. ||SJ) as: 



ki 



ki 



L{M)k. log2 log2 fcj + ^ log. 



(9) 



Additionally, we have to encode the observed data 
given the model to obtain the length L{D\M). In the 
simple case in which we do not have any information 
about the probability distribution of the data given the 
model, we can apply the same encoding scheme we have 
applied for describing the model. Assume that we have n 
observed data points t/fc and that the model / produces 
a fit to the observations such that the residual can be 
written as: 

r, = \v,-f{x,)\. (10) 

The encoding is performed by saving the number of data 
points and the value of each data point by using Eq. (j^J. 
The length L{D\M) is: 



L{D\M) = log* n + Y. [log* + log*(l/'5j) 



(11) 



which can be simplified if the number of data points n is 
large enough to give: 

n 

L{D\M) « log2 n + Y,[ log2 r, - log2 <5,] . (12) 
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Putting together Eqs. Q and ifTT)) . the total 2-part 
length of the message is: 



L = log* fc, + ^ [log* fc, + log* (1/5,) 



i=i 



-lO! 



TL 

g*n + ^[log*r,+log*(l/<5,)]. (13) 



If the number of data points n is large enough, we can 
use Eqs. ^ and l(T^ to obtain: 



L K. log, fci + ^ [ log, fc, - log, 



2.2. Computer- oriented Code Lengths 



(14) 



The previous equations are considered for an opti- 
mum encoding scheme. However, a simpler and more 
computer-oriented estimation can also be obta ined based 
on the previous results with good results ijGao et al.1 
iOOO). We assume that integer numbers are saved in 
a computer using li bits, while real numbers are saved 
in the memory of the computer using bits. Usually, 
integer numbers are saved using li — 16 bits, while real 
numbers can be stored using either /r = 32 or = 64 
bits depending on the desired precision. Therefore, we 
can estimate the length of the message that describes the 
model (the number of bits required for its storage in the 
memory of the computer) by: 



L{M) w fcJi, 



(15) 



where the number of bits required for transmitting the 
number of parameters given by log*ki can be usually ne- 
glected with respect to the length of the message used 
for transmitting the parameters themselves. Concerning 
the storage of the data, we consider that the model out- 
put is correct when the residual is smaller than a given 
precision threshold 6. If the model output is incorrect, 
we store the difference as a real number with bits, so 
that the length is given by: 



L{D\M) « 



(16) 



rj>5 



The length of the parameters associated with the model 
increase linearly with the number of parameters, while 
the length of the data set decreases, with a slope that 
depends on how well the model fits the data. 

2.3. Known distribution 

The previous encoding schemes have been obtained us- 
ing universal priors, thus assuming no knowledge at all 
about the distribution of the values of the data and/or 
model. However, if some information about the proba- 
bility distribution is known, it can be incorporated in the 
description of the encoding length through Eq. J^Jl. A 
typical case is when the observed data is contaminated 
with noise described by a Gaussian distribution with zero 
mean and a given standard deviation a (that may even 



be unknown) . For the set of n observations t/j , the prob- 
ability distribution of the residuals is given by: 



2a? 



(17) 



where r is the vector of n residual (ri, r2, . . . , r„) and 
is the variance for each data point. Assuming the same 
variance for all the data points and using Shannon's 
theorem, we obtain: 



- log (27ra2 



RSS 
2a2 ' 



(18) 



where RSS — X)J=i residual sum of squares. 

The previous equation constitute the estimation of the 
length for communicating the data set when we have an 
estimation for t he value of ct. If the variance is not known 
issane we can use the maximum likelihood es- 

timation cr^ ~ RSS/n and obtain: 



n n , 27r 71 , ^ „ „ 
- + -\og— + -\ogRSS, 
11 n A 

2.4. Example 



(19) 



For demonstrating the previous machinery in a simple 
practical problem, let us assume that we have a noisy 
linear combination of sinusoidal signal: 

f{x) = 2 sin(7rx) + sin(37r2;) — sin(47ra;) 

-2cos(87ra;) -Hcos(147ra;) +e. (20) 

The variable x is always inside the interval [0, 1] and e 
is a noise term with zero mean and standard deviation 
a. Let us consider that we have sampled the x axis in 
512 points and that a = (max(/) — min(/))/8. The 
frequencies of the signals can be obtained by performing 
the Fast Fourier Transform of the signal. However, for 
applying the MDL principle, we consider models of the 
type 

p 

f{x) = flo + ^ {uj cosljnx) + bj sin(j7rx)} . (21) 

The aim is to obtain the most plausible value of p that 
produces the smallest encoding length of the model and 
data given the model. Firstly, we apply the code length 
described in ^2. 21 using a threshold equal to the standard 
deviation of the data. The results are shown in the left 
panel of Fig. ^ Note that the number of bits necessary 
to encode the model increases linearly. The number of 
bits to describe the residual between the model and the 
observed data decreases rapidly when p < 10. The sum 
of both terms has a minimum around p ^ lA — 15, which 
coincides with the maximum value of p in the original 
signal. 

The right panel shows the results obtained when we 
take into account that the residuals between the model 
and the data are well characterized by a Gaussian dis- 
tribution and that the number of points is sufhciently 
large. In this case, the length of the model is given by 
Eq. lO, while the length of the data is given by Eq. ((T^ . 
The minimum of the total length gives the most plausible 
value of p ~ 14—15, compatible with the original data 
and with the value given by the previous estimation of 
the total encoding length. 
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3. APPLICATIONS 

In this section we apply the MDL principle to two se- 
lected problems in the field of solar spectropolarimetry 
in order to show the potential of this technique for model 
selection. Our aim is to apply the MDL criterion system- 
atically to other similar problems in the future. 

3.1. PCA de-noising 

As a first application, we consider the case of Principal 
Component Analysis (PCA) de-noising of spectropolari- 
metric observations. The dataset consists on full-Stokes 
observations of the two Fe I at 15648 A and 15652 A 
in an extremely quiet region of the sola r surface . This 
dataset has been used bv Martinez Gonzalez et alJ l)2006f) 
to investigate the magnetic properties of the quiet Sun. 
The signal-to-noise ratio of the observations have been 
improved by using the PCA de-noising technique. 

PCA is a statistical technique that, given a data set, 
produces a set of orthogonal vectors and eigenvalues that 
can be used for decomposing the original data. These 
eigenvectors point in the directions of maximum covari- 
ance. The eigenvector associated with the largest eigen- 
value points along the direction with the largest covari- 
ance in the data and so on. PCA provides an "opti- 
mal" basis set for decomposing (and reconstructing) the 
data set. Alternatively, it has been used for compressing 
data (by saving only the eigenvalues and ei genvectors) 
or fo r efficiently inverting Stokes profiles ijRees et'ehl 
For computing the PCA decomposition, the co- 
variance matrix of the observations has to be diagonal- 
ized. Once the eigenvectors e*(5') (vectors whose dimen- 
sion N\ equal the number of wavelenghts in the dataset) 
and eigenvalues Xi(S) of the covariance matrix associ- 
ated with the Stokes parameter S are obtained, any of 
the Stokes profiles can be decomposed as: 

N 

5(A,) = ^A,(5)e;.(5), (22) 

1=0 

where the subindex j refers to the wavelength position. 
The previous equation can be alternatively seen as a 
technique for reconstructing the original signal from the 
eigenvalues and eigenvectors obtained after the PCA de- 
composition. The precision of the reconstruction can be 
modified by changing the value of the number of eigen- 
vectors included (N). The PCA reconstruction assures 
that the error of the reconstruction decreases when more 
eigenvectors are included. The eigenvectors with the 
largest eigenvalues represent the features that are more 
statistically representative of the dataset, while the noise 
and particular features are accounted for by the eigen- 
vectors with the smallest eigenvalues. As a consequence, 
it is possible to get rid of the majority of the noise in the 
observation by stopping the summation of Eq. H22() in a 
suitable N' < N. Nevertheless, it is not an easy task to 
find a criterion for selecting this N' because instrumental 
effects plus data reduction can introduce spurious signals 
that have some kind of correlation. As a consequence, 
they contribute to the eigenvectors that carry the rele- 
vant polarimetric information. 

We propose to use the MDL criterion to select this op- 
timal N' . The experiment is carried out with the Stokes 
V profiles of the Fe I lines at 15648 A and 15652 A ob- 
served in a very quiet internetwork region of the Sun 



described bv lMartfnez Gonzalez et all l)2006r) . The orig- 
inal data presents a signal-to-noise ratio (SNR) of ^5 
for the 15648 A hue and -2 for the 15652 A line. The 
summation of Eq. (|22|l is calculated for increasing val- 
ues of N and the MDL length is calculated using the 
technique described in H2.2I The length of the model 
L{M) is obtained by calculating the number of bits to 
represent the eigenvectors e* with i = 0, . . . , plus the 
coefficients Ai for expanding all the profiles in the field of 
view. The length of the data set given the model L{D\M) 
is obtained by calculating the number of bits needed for 
encoding the reconstructed profiles that differ from the 
original profiles by more than a given threshold. FigureOl 
shows these lengths versus N for different values of this 
threshold. The length of the model is plotted in dot- 
dashed lines, the length of the data in dashed line and 
the total length in solid line. We have also marked the 
value of N at which we obtain the minimum of the total 
length. This is the MDL optimum value N' for the num- 
ber of eigenvectors. Note that this minimum increases 
when the allowed threshold decreases, a consequence of 
putting more restrictions to the model. Of course, this 
threshold should be chosen consistent with the expected 
noise in the observations. 

3.2. LTE inversion 

The diagnostic of magnetic fields via the interpreta- 
tion of spectropolarimetric observations is often based 
on the assumption of a model. Sometimes, the obser- 
vations themselves do not carry enough information for 
discriminating among several models. A similar problem 
arises when a model can have an arbitrarily large num- 
ber of parameters and it is not an easy task to select an 
optimum value of such parameters. A commonly used 
technique to minimize over-fitting is to use models with 
as few parameters as possible. Nevertheless, the data 
may contain enough information for constraining more 
parameters and we may be using overly simple models 
to interpret the observations. 

We propose to use the MDL criterion to discriminate 
among different models that can be used to describe a set 
of observations. Our aim is to introduce in the commu- 
nity an easy technique for confronting different models 
when applied to the same data set. In the framework of 
the MDL criterion, the researcher is able to objectively 
discriminate one of the models among the others, mak- 
ing sure that the selected model explains the observations 
without over- fitting them. 

We demonstrate our approach by using the LTE inver- 
sion c ode SIR (Stokes Inversion based on R esponse ixmc- 
tions: lRuiz Cobo fc del Toro Iniestalll992|) . An example 
of the Stokes profiles of the Fe I lines at 15648 A and 
15652 A observed in a very quiet internetwork reg ion are 
shown in Fig. 2| ijMartfnez Gonzalez et alJ I2006() . The 
inversion is carried out with a two-component model (a 
magnetic one occupying a fraction of the resolution ele- 
ment and a non-magnetic one filling up the rest of the 
space). The observations clearly show strongly distorted 
Stokes profiles that cannot be correctly reproduced with 
this simple two-component model. Although very sim- 
ple, this test demonstrates the capabilities of the MDL 
criterion to pick up a model when none of the models of 
the proposed set is able to correctly fit the observations. 
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SIR represents the variation with depth of the thermo- 
dynamical and magnetic properties with the aid of nodes 
that are equidistant in the logr axis. It is of interest to 
note that when a new node is included for representing 
the depth variation of a physical quantity, the previous 
nodes are shifted so that the final distribution is again 
equidistant. Splines are used to interpolate these quan- 
tities between the nodes. The number of nodes of the 
temperature (or the magnetic field strength) is increased 
and the MDL criterion is used for selecting the optimal 
values. We follow the prescriptions proposed in W2.2\ The 
length of the model is calculated as the number of bits to 
encode the number of parameters plus their values. The 
length of the data given the model is chosen to be equal 
to the number of bits neccesary to encode the points in 
the profile for which the relative difference between the 
model and the data is above a certain threshold. This 
threshold represents the tolerance we allow in our model 
for considering that it fits our observations. 

The first experiment consists in selecting the optimal 
number of nodes in the temperature when the number 
of nodes in the rest of variables are kept constant. We 
only compare the Stokes I observed and synthetic profiles 
in this experiment because it is the only Stokes parame- 
ter that is almost unsensitive to the magnetic properties 
of the atmosphere. This is not generally the case since 
Stokes I can be also sensitive to the magnetic field in 
the strong field regime. In analysing internetwork quiet 
Sun Stokes profiles, the filling factor of the magnetic 
component needs to be very small. In such a situation, 
the emergent Stokes I profile fundamentally depends on 
the properties of the non-magnetic component, while the 
emergent Stokes V profile depends only on the properties 
of the magnetic component. We consider that a point in 
the profile reproduces the observations if the relative er- 
ror between the observed and the synthetic profiles is 
below 2%. The results for the message length are shown 
in the left panel of Fig. \E\ while the fit is shown in the 
left panel of Fig. 0| The MDL criterion demonstrates 
that ~2 nodes are required in the temperature depth 
profile. Fewer nodes give a fit to the Stokes profiles that 
is too bad. More nodes give a model that takes more 
bits to communicate than the uncorrectly fitted points 
themselves. 

The second experiment is similar, but in this case we 
select the number of nodes of the magnetic field strength. 
The threshold for the relative error in the Stokes I pro- 
files is 2%, while we increase it to 10% for the Stokes V 
profiles. This large relative error for the Stokes V profile 
is motivated by the poor fit we obtain to the observed 
Stokes profiles with the simple two-component model. 
The message length is shown in the right panel of Fig. [51 
while the fit is shown in the right panel of Fig. 0] The op- 
timum number of nodes for the magnetic field suggested 
by the MDL criterion is ~3. 

In this section we have presented a very simple prob- 
lem concerning the selection of the optimal number of 



parameters in an LTE inversion. However, we consider 
that this approach will be of great interest for solving 
such a difficult problem in a simple way and we suggest 
calculating the MDL criterion for all the fits performed 
to an observed profile. 

4. CONCLUSIONS 

We have presented the Minimu m De scription Length 
Principle developed by Rissanei^ 1)1978(1 to discriminate 
between a set of available models that can approximate 
a given data set. We have briefly presented its relation 
with the Bayesian approach of model selection through 
the application of the Shannon's theorem. In our opin- 
ion, the MDL principle presents a user friendly procedure 
for model selection. We have presented simple ways of es- 
timating the message length that can be applied for com- 
municating integer and real numbers. A more computer- 
oriented and easy to implement procedure has been also 
shown. 

For the sake of clarity, we have applied the MDL prin- 
ciple to simplified problems. The selection of the op- 
timal number of PCA components when de-noising ob- 
served Stokes profiles and the selection of the optimal 
number of nodes in the temperature and magnetic field 
depth profile obtained from LTE inversion of Stokes pro- 
files. The results show the potential of this technique for 
model selection, with the advantage of being very simple 
to calculate. 

As the main conclusion of this paper, we propose us- 
ing the MDL principle as a way to quantitatively select 
among different competing models. We propose to in- 
clude the description length as one of the final outputs 
of any inversion code. This makes it very easy to select 
the optimal model from a proposed set of models based 
on the framework of the MDL principle. It is of interest 
to stress that this principle can be used to select among 
models that are based on different scenarios. Usually, 
most complex scenarios translate into more degrees of 
freedom that may not be constrained by the observations. 
Using the MDL principle, the selected model among all 
the possibilities might not give the best fit to the obser- 
vations but it represents the model that produces a good 
fit with a conservative number of parameters. 

As a final remark, a possible future way of research 
may be how the MDL principle can be implemented as 
a regularization of the merit function in the existing in- 
version techniques. As a consequence, one would end 
up with an inversion code that automatically selects the 
optimal model. 

We thank M. J. Martinez Gonzalez for helpful dis- 
cussions on the subject of the paper. We also thank 
Thornsten A. Carroll for his careful review and useful 
comments. This research has been funded by the Span- 
ish Ministerio de Educacion y Ciencia through project 
AYA2004-05792. 
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Fig. 1. — Example showing the application of the MDL criterion for model selection. The aim is to fit a linear combination of sinusoidal 
signals with maximum frequency 14 using a Fourier series. The left panel shows the message length obtained using the computer-oriented 
code lengths. The right panel shows the results obtained using a gaussian distribution for the probability density of the residual. Note that 
in both cases we find that MDL criterion gives the correct value of the maximum frequency. 
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Fig. 2. — Eigenvalues obtained from the decomposition of the observed Stokes V profiles described bv lMartm ez Go nzalez et al.l <2006l) . 
Note the monotonic decay. The eigenvectors associated with the largest eigenvalues carry most of the signal (features that present strong 
correlations for a large set of observed profiles in the field-of-view) , while the eigenvectors associated with the smallest eigenvalues are 
mainly associated with uncorrelated noise. 
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Fig. 3. — Application of the MDL criterion to the de-noising of spectropolcirimetric signals of quiet Sun observations. The length of the 
model (dot-dashed line) , the data set given the model (dashed line) and the total length (solid line) axe plotted versus the number of PC A 
components included in the data reconstruction for different values of the threshold that sets the precision of the reconstruction (shown in 
the title of each plot). The vertical dashed line indicates the approximate minimum of the total length curve. Note that the number of 
PCA components obtained with this MDL criterion increases as the threshold decreases. 
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Fic;. 4. — One of the observed Stokes I (left panel) and Stokes V (right panel) profiles is shown in circles. The fits obtained with 
the two-component model SIR inversion are shown in solid lines. Note that this model is not powerful enough for fitting this strongly 
asymmetric profiles. Both fits have been obtained using 2 nodes in the temperature. The fit of the Stokes V profiles has been obtained 
using 4 nodes in the magnetic field strength depth profile. 
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Fig. 5. — Application of the MDL criterion to the selection of the optimum number of nodes in an LTE inversion with the SIR code. The 
left panel shows the message length when only the number of nodes of the temperature depth profile is changed. The right panel shows 
the message length when the number of nodes of the magnetic field strength profile is changed. 



